Add coordinator module and corresponding unit test#1465
Add coordinator module and corresponding unit test#1465NumberWan wants to merge 14 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1dd914a75a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm_omni/distributed/omni_coordinator/omni_coord_client_for_hub.py
Outdated
Show resolved
Hide resolved
8526123 to
3a85b30
Compare
… test Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
f6c4544 to
cda6939
Compare
Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
lishunyang12
left a comment
There was a problem hiding this comment.
Left a few comments. Main concern is the TOCTOU race in _handle_event — the check-then-act on self._instances without holding the lock can cause duplicate registrations under concurrent events. The rest are smaller cleanups (license headers, type annotations, hardcoded test ports).
…types, dynamic ports Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
…types, dynamic ports Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
…types, dynamic ports Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
hsliuustc0106
left a comment
There was a problem hiding this comment.
PR #1465 Review: Add coordinator module and corresponding unit test
Overview
This PR implements the OmniCoordinator module for distributed stage instance coordination (Major PR 1 from #984). The implementation closely follows the design document with a few important deviations.
Critical Issues: 4 found
1. Instance Registry Never Removes DOWN Instances (Design Deviation)
File: omni_coordinator.py:271-276
The design document states instances should be "removed from the internal registry" when DOWN, but the implementation only marks them as DOWN and keeps them forever:
def _remove_instance_locked(self, event: InstanceEvent) -> None:
info.status = StageStatus.DOWN # Only marks, never removesImpact: Memory leak, unbounded registry growth over time.
2. ZMQ Context Singleton Causes Test/Production Issues
Files: omni_coordinator.py:51, omni_coord_client_for_stage.py:27, omni_coord_client_for_hub.py:28
Using zmq.Context.instance() shares a global context across all components:
self._ctx = zmq.Context.instance() # Global singletonImpact: Tests interfere with each other, closing one coordinator can break others, cannot run multiple coordinators.
Recommendation: Use per-coordinator context: self._ctx = zmq.Context()
3. No Connection State Machine / Reconnection Logic
Files: omni_coord_client_for_stage.py, omni_coord_client_for_hub.py
The design mentions reliability with retry mechanisms, but there's no reconnection logic if the ZMQ connection drops. Network blips will cause silent failures with no recovery path.
Recommendation: Add connection state machine with exponential backoff reconnection.
4. Scalability Limit Not Documented
The single-threaded coordinator has inherent scalability limits:
| Instances | Messages/sec | Status |
|---|---|---|
| < 500 | < 150 | ✅ OK |
| 500-2000 | 150-600 | |
| > 2000 | > 600 | 🔴 Bottleneck risk |
Recommendation: Add max_instances parameter or document expected limits.
Strengths
- ✅ Good test coverage matching design spec
- ✅ Message schemas match design document exactly
- ✅ Clean separation of concerns (messages, coordinator, clients)
- ✅ Thread-safe with proper locking
- ✅ ZMQ best practices (RCVTIMEO, NOBLOCK)
Recommendation
Address issues #1 and #2 before merge. Issue #3 (reconnection) could be documented as Phase 2 work if needed.
vllm_omni/distributed/omni_coordinator/omni_coord_client_for_stage.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
…ontexts Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
… when disconnected Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
… when disconnected Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>
lishunyang12
left a comment
There was a problem hiding this comment.
All threads resolved LGTM
|
All my previous comments are addressed — thanks for the quick fixes. I'll defer to @hsliuustc0106 on the remaining items from their review (instance removal, reconnection logic). |
Part of #984 - adds OmniCoordinator module
Summary
Add OmniCoordinator module for distributed stage instance coordination.
Related
Part of #984 - this PR implements the coordinator module.
ADD